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STATISTICS OF REPETITIONS 

In order to be able to obtain reliable estimates of 
the value of given repeats we need to have infor¬ 
mation about repetition in plain language. Sup¬ 
pose for example that we have placed two mes¬ 
sages together and that we find repetitions con¬ 
sisting of a tetragramme, two bigrammes, and 
fifteen single letters, and that the total overlap 
was 105, i.e. that the maximum possible num¬ 
ber of repetitions which could be obtained by 
altering letters of the messages is 105; suppose 
also that the lengths of the messages are 200 and 
250; in such a case what is the probability of 
the fit being right, no other information about 
the day’s traffic being taken into consideration, 
but information about the character of the un¬ 
enciphered text being available in considerable 
quantity? 


In theory this can be solved as follows. We take 
a vast number of typical decodes, say 10 10 , and 
from them we select all of length 200 and all of 
length 250. We encipher all of these messages at 
all possible positions on the machine (neglecting 
for simplicity the complications due to different 
daily keys). We then compare each message 200 
long with each 250 long in such a way as to get an 
overlap of 105 as with the fit under consideration. 
From the resulting comparisons we pick out just 
those cases where the repetitions have precisely 
the same form as in the case in question. 


This set of comparisons will be called the rele¬ 
vant comparisons. Among the relevant compar¬ 
isons there will be some which are right com¬ 
parisons, i.e. where corresponding letters of the 
two messages were enciphered with the same po¬ 
sition of the machine. The probability that our 
original fit was right can now be expressed in the 
form: 

Number of right relevant comparisons 

Total number of relevant comparisons 

The work involved in this theoretical method can 
be vastly reduced if we make a few harmless as¬ 
sumptions. In the first place if we assume that 
the encipherment keys at the various positions 
of the machine are hatted we can calculate the 
number of relevant wrong comparisons. Suppose 
the total number of repeated letters in the case 
in question is R, then 


Number of relevant wrong comparisons 


Total number of wrong comparisons 

R / nr \ ( L-R ) 
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For the calculation of the number of relevant 
right comparisons we have to make other as¬ 
sumptions. The sort of assumption that we need 
is that a repetition in one place is not made 
any the more or less likely by a repetition else¬ 
where. Actually this assumption would not be 
quite true, as it clearly does not hold in the case 
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of adjacent letters. For more practical purposes 
I think the following assumption is sufficiently 
near to the truth: 

If we know that at a certain point P there is not a 
repetition, then knowledge that there is or is not 
a repetition at a point A before P does not make a 
repetition at a point B after P either more likely 
or less likely. The probability of a repetition at 
any point is also independent of its distance from 
the end of either message. 

With these assumptions we could get the right 
distribution of numbers of comparisons between 
the various repetition figures if we assume the 
repetition figures for the comparisons constructed 
in this way. We are given an urn containing a 
large number of cards, some bearing the words 
no repeat, some bearing simple repeat, some 
bigramme, some trigramme, and so on. To 
construct a random sample of repetition figures 
of comparisons of given length we make a series 
of draws from the urn. 

The first few draws determine the repetition fig¬ 
ure for the first comparison, the next few for the 
next comparison, and so on. When we draw no 
repeat we have to add a 'O’ to the repetition 
figure, when we draw simple repeat add ‘XO’, 
for bigramme we add ‘XXO’ and so on. When 
we have got to the right length of overlap re¬ 
quired the comparison is completed and our next 
draws refer to the next comparison. If it hap¬ 
pens that the right length is never reached be¬ 
cause we ‘jump past it’ then we scrap that com¬ 
parison, and go on to the next. As an example 
suppose that we are making comparisons with an 
overlap of 12, and that our first draws are tetra- 
gramme, no rep, no rep, no rep, bigramme, 
no rep, trigramme, 13-gramme, then no rep 
13 times, our first two comparisons will have the 
repetition figures: 

XXXXD000XX00 

000000000000 

the one starting ‘XXX0’ being rejected because 
we never reach the right length of overlap. (This 
arrangement requires that every repetition figure 


should end with ‘O’, and therefore the genuine 
repetition figure should be obtained by cross¬ 
ing this off; but I shall not be too meticulous 
about details arising from the ends of the com¬ 
parison) . 

The number of draws required to produce a given 
figure is the number of non repeating letters, i. e. 
the overlap less the number of repeating letters. 
With our convention about crossing off the last 
letter we have to add 1. 

Two problems arise from this picture 

(1) How do we calculate the correct propor¬ 
tions of cards in the urn? 

(2) Given the proportion of the cards in the 
urn, how do we calculate the number 
of right relevant comparisons, and hence 
the probability of a given fit? 

The correct proportion of the cards in the urn 
can be calculated from the actual distribution 
of repetitions in the case of messages correctly 
set, or, what comes to the same thing, in mes¬ 
sages unenciphered and arbitrarily set. Let us 
suppose that we have a large number of such 
comparisons of unenciphered messages, and that 
the messages are sufficiently long that complica¬ 
tions arising from the ends of the messages can 
be neglected. 

The proportion of cards bearing the words sim¬ 
ple repeat, bigramme, trigramme, etc., must 
obviously be in the same ratio as the number of 
corresponding repeats in our comparisons. The 
number of no repeat cards will be calculated 
slightly differently as we have to subtract one 
case of no repeat for each sequence of repeat¬ 
ing letters. 

To get the best value from given material we nat¬ 
urally make every possible comparison. If we 
do this the right number of repetitions can be 
calculated quite easily without actually making 
the comparisons. Theoretically we can imagine 
the complete set of comparisons made in this 
way. 
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First of all we write out all the decodes (say 50 of 
them) one after another round a circle; suppose 
that the number of letters on this circle is N. 
The whole is then repeated on a concentric cir¬ 
cle. All possible comparisons can be made by 
rotating the one circle with respect to the other. 
From these we have to remove the comparison in 
which the circles are not rotated at all, for obvi¬ 
ous reasons. Also when the rotation is more than 
180° we get essentially the same comparison as 
one with less than 180°. The net effect of this, 
taking into account also the special case of exact 
180° rotation, is that the total overlap of all the 
comparisons is: 

N(N — 1) 

2 ' 

Now let us consider for example the total num¬ 
ber of tetragramme repeats in all these compar¬ 
isons. These can be divided into repeats aris¬ 
ing from AAAA those from AAAB .. . those from 
ZZZZ, the largest contribution arising presum¬ 
ably from such tetragrammes as EINS. The num¬ 
ber of tetragrammes arising from EINS consists 
of the number of pairs of hexagrams such as 
QEINSR, VEINSW in which the first letters of each 
are different, the last different, and the remain¬ 
der spell EINS. This number of pairs we will call 
the actual number of tetragramme repeats arising 
from EINS. 

The actual number of tetragramme repeats is ob¬ 
tained by summing over AAAA, AAAB, ..., EINS, 
..., ZZZZ. This actual number is not easily cal¬ 
culated directly, but we can more easily obtain 
the apparent number of tetragramme repeats, and 
this leads to the actual number. The appar¬ 
ent number of tetragramme repeats arising from 
EINS is defined to be the number of pairs of oc¬ 
currences of EINS in the material, and the ap¬ 
parent number of tetragramme repeats defined 
by summation. 

We can also define the apparent number of tetra¬ 
gramme repeats in a comparison as the num¬ 
ber of different series XXXX in the comparison. 
Thus a heptagramme repeats gives four appar¬ 
ent tetragramme repeats.The actual number of 


repeats can be calculated from the apparent in 
this way. 

Let M r be the apparent number of r-grannnes, 
and N r the actual number. 

Then 

M r = N r + 2iv r+ i + m r+2 + • • •, 

so that 

M r — M r+ i = N r + N r+ i + W+2 T ■ • •, 

and 

N r = (M r — M r+ i) — ( M r+ i — M r+2 ), 

— M r — 2M r+1 -(- M r+2 • 

It is therefore sufficient to calculate only appar¬ 
ent numbers and to carry these two stages fur¬ 
ther that we want to go with the actual numbers. 
In practice octagramme repeats are so certain to 
be right that it will be sufficient to have statistics 
only as far a heptagrammes. We therefore need 
statistics of apparent numbers of repeats as far 
a 9-grammes. To get these numbers of apparent 
repeats is sufficient to take all the 9-grammes in 
the material ( i.e. on the circle) and to put them 
into alphabetical order. This can be done very 
conveniently by Hollerith. 

The number of trigramme repeats say can then 
be found very simply (although with a good deal 
of labour) by considering only the first three let¬ 
ters of each 9-gramme. Suppose we denote by 
t a typical trigramme and by nt the number of 
its occurrences, then the apparent number of tri¬ 
gramme repeats is 



t 


When calculating the proportion of cards in the 
urn we must remember that the total number of 
cards is not 

N(N-l) 

2 ’ 

but is less than this by 

Y, rN '- ■ 

In our later calculations it is convenient to re¬ 
gard the comparisons in wrong places as also 
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constructed by drawing from an urn. In this 
case we easily see that the apparent number of 
r-grammes is 

( N(N-1) \ / J_\ 

\ 2 ) \26 r ) ’ 

and from this we deduce that the actual propor¬ 
tion of r-gramme cards is 

^ and of no repeat cards is . 


25 

26( r+1 ) 


We now turn to the problem of calculating the 
probability of a given fit when we know the pro¬ 
portion a r of r-gramme cards that are in the 
urn for each r. The calculation is going to be 
slightly complicated by the convention which we 
introduced, that not all drawings can lead to a 
comparison. We have therefore to calculate the 
proportion of draws which do lead to a compar¬ 
ison, i.e. in which the length does not overshoot 
the mark. The answer is that as the length of the 
overlap tends to infinity the proportions tends 
to 

1 

1 + X>a r 

in the case of hatted material this is (25/26). 

Now we put A = 1 —^ ay. Consider a repetition 
figure in which there are k r r-grammes. Let the 
overlap be L. The number of no repeat cards 
drawn is L + 1 — ^2(r + l)k r . The proportion of 
right draws which are relevant is 

OO 

A L+i-J2(r+i)k r J-J- a k r ) 
r=l 

and then the proportion of the right compar¬ 
isons which are relevant is (assuming L reason¬ 
ably large) 

ra^A L+1 ~^ r+1)K JJ a \W 

V J r=1 

Similarly calculating with the urn whose propor¬ 
tions were made up from hatted materiel we find 
for the proportion of wrong comparisons which 


are relevant 



2 5 \i+ 1 -£b'+ 1) *v^ oo / 25 \ 

26/ X 11 i 26C+ 1 ) / 

' r—1 x / 


Hence the oddf(3on our fit are 

_ ,25(l + £ra r ) /26H\ i+1_E(r+1)fcr 

q ~ 26 V 25 / 



where A is the a priori odds. This is most con¬ 
veniently written as: 

log q = log A + fi r k r — vL 

+ log [(!-$» (! + 5>r) , 


where 



In the case of overlap zero there is a discrepancy 
of 


log 


( 1- H ar ) ( 1 + H 


VOL, 


due to the overlap not being long. This term is 
in any case microscopic. 


oo 


* The odds on an event are defined to be the probabil¬ 
ity of the event divided by the probability of its negation. 















